Introduction to Open Data Science - Course Project

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Wed Nov 23 17:56:41 2022"

The text continues here.

I mostly work with Python nowadays and was using R shortly some time ago. I suppose one of most “challenging” problems for me in this course will probably be debugging in R.

My main goal of this course is to improve my visualization skills. Also, it is always good to review some basic knowledge about data science including regression and clustering, as well as review some common function call or tricks in R.

The repo of this course dairy can be found in https://github.com/hj940709/IODS-project


Regression and model validation

date()
## [1] "Wed Nov 23 17:56:41 2022"
set.seed(0)

Load data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.5
## ✔ tibble  3.1.8     ✔ stringr 1.4.1
## ✔ tidyr   1.2.1     ✔ forcats 0.5.2
## ✔ readr   2.1.3     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)
learning2014 <- read_csv('data/learning2014.csv')
## Rows: 166 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): gender
## dbl (6): age, attitude, deep, stra, surf, points
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(learning2014)
## spc_tbl_ [166 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ gender  : chr [1:166] "F" "M" "F" "M" ...
##  $ age     : num [1:166] 53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num [1:166] 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num [1:166] 3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num [1:166] 3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num [1:166] 2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : num [1:166] 25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   gender = col_character(),
##   ..   age = col_double(),
##   ..   attitude = col_double(),
##   ..   deep = col_double(),
##   ..   stra = col_double(),
##   ..   surf = col_double(),
##   ..   points = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

This is a dataset collected from an exam. It has 166 observations (row) and 7 variables (col). The variables are gender, age, attitude, deep, surf, stra, and points. I will skip the meaning of gender, age, attitude and points as their meanings are quite straightforward. The meaning of rest variables is listed as following below: * deep: Deep approach * surf: Surface approach * stra: Strategic approach

The value of these variables are mostly numbers, either integer or float, except gender, which is filled with two types of characters: “F” and “M”.

Graphical overview and variable summary

p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

p

With the code above, we are able to display the correlation between variables. As we can see from the figure, attitude and points have the most significant correlation. I suppose this makes sense in general. Both stra and surf also have a strong correlation but surf has a negative correlation.

Fit a linear model and its summary

model <- lm(points ~ attitude + stra + surf, data = learning2014)

summary(model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

As what is shown in the previous figure, I select attitude, stra and surf to be the explanatory variable. According to the summary, the coefficient for attitude, stra and surf is 3.3952, 0.8531 and -0.5861 respectively. The intercept of this model is 11.0171.

The test of this model shows that the residual is distributed within -17.1550 and 10.9852 and the mean of residual is 0.5156.

Explain and interpret the multiple R-squared of the model

The summary of this model shows that attitude is the most significant variable in this model (Pr=1.93e-08), while the other two are less significant (Pr>0.01). This aligns with the coefficients of these three explanatory variables, where the coefficient of attitude is the highest absolute value.

The multiple R-squared of this model is 0.207. It is not a high value, which suggests the model is not fully explainable by attitude, stra, and surf. Considering I am using explanatory variables that are most correlated with points, it means we either need to consider a more complicated model rather than simple linear regression or there are other missing explanatory variables that are not available from the dataset.

Diagnostic plots

The assumption of the simple linear model is that all explanatory variables follow a linear relationship with the target variable. The observations and residual should be independent. The residual should also follow the normal distribution.

  • Residuals vs Fitted: From this plot, we can see the red curve fluctuate around the perfect horizontal line. Scatter dots #35, #56 and #145 seem to be outliers in this plot.
plot(model, which = 1)

  • Normal QQ-plot: In this plot, we can see the majority of the dots fall along the straight diagonal line. There are a small number of observations that deviate from the line. #35, #56, and #145 seem to deviate the most. We can claim that the residual mostly follows the normal distribution.
plot(model, which = 2)

  • Residuals vs Leverage: This plot is supposed to show influential observations according to Cook’s distance. However, I don’t there is a curve for this metric. I suppose it means there aren’t any influential observations in this dataset.
plot(model, which = 5)

In conclusion, even though the model is not fully explainable by attitude, stra, and surf. The plots here suggest it is still appropriate to assume these explanatory variables follow a linear patterns in terms of the target variable.


Logistic regression

Describe the work you have done this week and summarize your learning.

date()
## [1] "Wed Nov 23 17:56:50 2022"
set.seed(0)
library(dplyr)
library(tidyverse)
library(GGally)
library(ggplot2)
library(boot)

Load data

alc <- read_csv('data/alc.csv')
## Rows: 370 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
## dbl (17): age, Medu, Fedu, traveltime, studytime, famrel, freetime, goout, D...
## lgl  (1): high_use
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
glimpse(alc)
## Rows: 370
## Columns: 35
## $ school     <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP",…
## $ sex        <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "F",…
## $ age        <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
## $ address    <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U",…
## $ famsize    <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", "LE…
## $ Pstatus    <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "T",…
## $ Medu       <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
## $ Fedu       <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
## $ Mjob       <chr> "at_home", "at_home", "at_home", "health", "other", "servic…
## $ Fjob       <chr> "teacher", "other", "other", "services", "other", "other", …
## $ reason     <chr> "course", "course", "other", "home", "home", "reputation", …
## $ guardian   <chr> "mother", "father", "mother", "mother", "father", "mother",…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
## $ schoolsup  <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no", "n…
## $ famsup     <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "yes",…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", "ye…
## $ nursery    <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "yes…
## $ higher     <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", "ye…
## $ internet   <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "yes",…
## $ romantic   <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "no"…
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
## $ paid       <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes", …
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9, 5,…
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, 16,…
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15, 16…
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16, 1…
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1.0,…
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…

This is a dataset combined from two open dataset, which are both collected from two Portuguese schools by school reports and questionnaires. The meta data of of the original dataset can be found from here. Two extra variables has been added to the dataset alc:

  • alc_use: the average of Dalc and Walc
  • high_use: TRUE if alc_use is higher than 2 and FALSE otherwise

Overall, the dataset alc contains 370 observations (rows) with 35 variables (columns), which have been printed out above.

4 variables and hypotheses

I would like to choose the following variables to study their relationship with alcohol consumption:

  • age: Adult students are legally allowed to buy and drink alcohol, while minors are not allowed.
  • freetime: A student who have more time after school may consume more alcohol
  • goout: A student who going out more often with friends are more likely to be in a party or a similar situation. This means such a student may consume more alcohol.
  • famrel: Parents will usually guide their kids well and prevent them from bad lifestyle. But a student with a terrible family relationship may easily have some bad habits. One of the problems may be alcohol addiction.

Expore the distributions of my chosen variables and their relationships with alcohol consumption

alc %>% group_by(age) %>% summarise(alcohol_consumption = mean(alc_use))
## # A tibble: 7 × 2
##     age alcohol_consumption
##   <dbl>               <dbl>
## 1    15                1.62
## 2    16                1.85
## 3    17                2.04
## 4    18                2   
## 5    19                1.91
## 6    20                1   
## 7    22                5
ggplot(alc, aes(y = alc_use, x = age, group = age)) + geom_boxplot()

The table above shows there are 7 groups from 15 years old to 22 years old in the dataset. From this table, we can see there is a positive correlation between age and average alcohol consumption, except age group 19 and 20. This is generally aligned with the box chart I made, which shows that the distribution of alcohol consumption goes up when students getting older. This means my hypothesis about age and alcohol consumption generally holds.

alc %>% group_by(freetime) %>% summarise(alcohol_consumption = mean(alc_use))
## # A tibble: 5 × 2
##   freetime alcohol_consumption
##      <dbl>               <dbl>
## 1        1                1.56
## 2        2                1.72
## 3        3                1.79
## 4        4                2.05
## 5        5                2.28
ggplot(alc, aes(y = alc_use, x = freetime, group = freetime)) + geom_boxplot()

This table shows a positive correlation between freetime and average alc_use as well. The box chart shows that the distribution of alc_use are more spread out and reach to a higher value, if there is a higher freetime.

alc %>% group_by(goout) %>% summarise(alcohol_consumption = mean(alc_use))
## # A tibble: 5 × 2
##   goout alcohol_consumption
##   <dbl>               <dbl>
## 1     1                1.41
## 2     2                1.56
## 3     3                1.71
## 4     4                2.14
## 5     5                2.73
ggplot(alc, aes(y = alc_use, x = goout, group = goout)) + geom_boxplot()

Similar as the previous two variables, the result shows there is a positive correlation between goout and average alc_use. This is also confirmed with the box plot. My hypothesis about students who goes out more with friends may consume more alcohol generally holds true.

alc %>% group_by(famrel) %>% summarise(alcohol_consumption = mean(alc_use))
## # A tibble: 5 × 2
##   famrel alcohol_consumption
##    <dbl>               <dbl>
## 1      1                2.25
## 2      2                2.19
## 3      3                2   
## 4      4                1.88
## 5      5                1.74
ggplot(alc, aes(y = alc_use, x = famrel, group = famrel)) + geom_boxplot()

This table shows a negative correlation between famrel and average alc_use, which suggests a worse family relationship can generally leads to more alcohol consumption. The box plot aligns with my hypothesis as well. Generally, the worse a famrel is, the higher value alc_use is distributed. Dispite the mean, upper quartile and upper whisker of the worst famrel box is lower than the second worst one, its lower quatile is the highest, which may still suggest my hypothesis holds true.

Logistic regression

model <- glm(high_use ~ age + freetime + goout + famrel, data = alc, family = "binomial")
summary(model)
## 
## Call:
## glm(formula = high_use ~ age + freetime + goout + famrel, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6976  -0.8063  -0.5542   0.9595   2.4772  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -4.3760     1.8489  -2.367  0.01794 *  
## age           0.1279     0.1069   1.196  0.23160    
## freetime      0.2089     0.1369   1.526  0.12695    
## goout         0.7208     0.1241   5.807 6.38e-09 ***
## famrel       -0.4256     0.1376  -3.092  0.00199 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 391.01  on 365  degrees of freedom
## AIC: 401.01
## 
## Number of Fisher Scoring iterations: 4

As what the model summary shows, goout and famrel are the two most significant variables when predicting high_use. The Pr(>|t|) of these two variables are 6.38e-09 and 0.00199 respectively. The other two variables age and freetime are less significant.

OR <- coef(model) %>% exp
CI <- confint(model)
## Waiting for profiling to be done...
cbind(OR, CI)
##                     OR       2.5 %     97.5 %
## (Intercept) 0.01257509 -8.05653672 -0.7821400
## age         1.13647295 -0.08120815  0.3394566
## freetime    1.23230323 -0.05804507  0.4799409
## goout       2.05616312  0.48382325  0.9716157
## famrel      0.65340188 -0.69939890 -0.1577806

According to the odd ratio tables, we can see that goout has a strong positive connection with high_use (OR > 2), while famrel has a strong negative relationship with high_use (OR < 1). age and freetime also have a positive relationship with high_use. But they are less significant as goout. This suggests my previous hypothesis generally holds true

Explore the predictive power of my model

alc <- mutate(alc, probability = predict(model, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

g <- ggplot(alc, aes(x = high_use, y = probability, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.64594595 0.05405405 0.70000000
##    TRUE  0.19189189 0.10810811 0.30000000
##    Sum   0.83783784 0.16216216 1.00000000
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}



loss_func(class = alc$high_use, prob = alc$prediction)
## [1] 0.2459459
loss_func(class = alc$high_use, prob = runif(nrow(alc), 0, 1) > 0.5) # Random guessing
## [1] 0.4837838

The training error is 24.6%, which is lower than fliping a coin without any proof and guessing 1 for all observations. This means the variables I selected and the model I trained are effective. However, guessing 0 for all achieves 30% error rate, which is close to my result. This suggest the data itself is not balanced, but it also makes sense because high school students are generally less likely to have a drinking problem.

Bonus: 10-fold cross-validation

model_2 <- glm(high_use ~ absences + failures + goout + famrel, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model_2, K = 10)
cv$delta[1]
## [1] 0.2540541

I replaced the two least significant variables in my initial model with the two most significant variables in the model introduced in the Exercise Set. The result is slightly better than the result in Exercise Set

Super-Bonus: Compare the performance of different logistic regression models

all_predictors <- rev(c('goout', 'absences', 'failures', 'famrel', 'health', 'sex'))
num_predictors <- length(all_predictors):1

train_losses <- c()
test_losses <- c()

for( i in num_predictors){
    predictors <- all_predictors[1:i]
    f <- as.formula(paste('high_use', paste(predictors, collapse = ' + '), sep = ' ~ '))
    glm <- glm(f, data = alc, family = "binomial")
    alc <- mutate(alc, probability = predict(glm, type = "response"))
    alc <- mutate(alc, prediction = probability > 0.5)
    train_loss <- loss_func(class = alc$high_use, prob = alc$prediction)
    train_losses <- append(train_losses, train_loss)

    cv <- cv.glm(data = alc, cost = loss_func, glmfit = glm, K = 10)
    test_loss <- cv$delta[1]
    test_losses <- append(test_losses, test_loss)
    
}


df = data.frame(num_predictors, train_losses, test_losses)
ggplot(df, aes(num_predictors)) +                    # basic graphical object
  geom_line(aes(y=train_losses, colour="Train")) +  # first layer
  geom_line(aes(y=test_losses, colour="Test")) +  # second layer
  scale_color_manual(name = "Losses", values = c("Train" = "darkblue", "Test" = "red")) + 
  ylab("Loss")

As what we can see from the plot above, both training loss and test loss drop when there are more variables get involved.


Clustering and classification

date()
## [1] "Wed Nov 23 17:56:53 2022"
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(tidyverse)
library(corrplot)
## corrplot 0.92 loaded
library(ggplot2)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

Load data

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

This is the Boston dataset coming from MASS package of R. The dataset records the statistics related to the housing values in suburbs of Boston. As what we can see from above, it contains 506 observations and 14 variables. All variables is recorded as a number. Most of variables are float numbers, except chas and rad are intergers. The description of each variables can be found from here.

Graphical overview and variable summary

cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

The plot above is a visualization of the correlation matrix of variables in the Boston dataset. The plot summarize the interrelationship among all the variables of the dataset. As what we can see from the plot, most of variables are more or less correlated with each other, except chas. It seems chas is independant from the rest of variables. indus is probably the variable with the most correlation with other variables (except chas). It has a strong positive correlation with nox and a stong negative correlation with dis.

Standardize the dataset and scaled data summary

After standardizing, the variables will be scaled so that the mean of each variable is zero and the standard deviation is one. The summary of the scaled dataset is listed below:

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = quantile(boston_scaled$crim), include.lowest = TRUE, label=c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

n <- nrow(boston_scaled) # number of rows in the Boston dataset
ind <- sample(n,  size = n * 0.8) # choose randomly 80% of the rows
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]

Fit an LDA and draw its biplot

lda.fit <- lda(crime ~ ., data = train) # linear discriminant analysis

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

Validate LDA

correct_classes <- test$crime # Save the crime categories from the test set
test <- dplyr::select(test, -crime) # remove the categorical crime variable
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       15       9        2    0
##   med_low    6      17        2    0
##   med_high   1       6       17    4
##   high       0       0        1   22

It seems the LDA model runs quite well on the test set, where most of observations are classified into the correct categories. The error rate of this LDA is around 23%

K-means

set.seed(0)
km <- kmeans(Boston, centers = 3)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
dist_eu <- dist(Boston) # Euclidean distance of Boston
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

The plot above suggests the optimal number of clusters is 2, as WCSS drops the most at 2.

km <- kmeans(Boston, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

The plot above shows how K-means cluster the dataset. Clusters are colored into red and black. It seems K-means works the best on tax and rad, as there is an obvious separation between two cluster.

Bonus: Visualize a biplot of LDA based on K-means clustering

data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
km <- kmeans(Boston, centers = 3)

lda.fit <- lda(km$cluster ~ ., data = boston_scaled)
classes <- as.numeric(km$cluster)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

The plot above is biplot of K-means clustering for Boston dataset in two dimensional space. The number of cluster is 3. It seems K-means cluster the dataset fairly well. According to the plot above, tax and rad have more variation than the rest variables. Also, these two variables are almost independant to each other.

Super-Bonus: Visualize and compare LDA and K-means in 3D

lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~train$crime)

I tried running with K-means twice with different number of clusters: 2 (the optimal number of clustering) and 4 (the number of categories for crime).

km = kmeans(model_predictors, centers = 2)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~factor(km$cluster))
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

When the number of clusters is 2, the 3D plot above shows that the K-means does a fairly good job as what we expected. The main body of two cluster are obvious. Comparing to the plot which colors observations according to crime, it seems one cluster stands for high, while the other one include all the rest of categories. Despite some observations are clustered into the high cluster, they are located much closer to the other cluster in 3D space. It seems that those observations mostly belong to med_high. I think the clusering makes sense considering med_high observations are logically more close to high observations.

km = kmeans(model_predictors, centers = 4)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=~factor(km$cluster))

When the number of clusters is 4, the plot above seems to be more similar to the one colored by crime. Yet, there are observations that are mis-classified into another cluster. The most obvious ones are mis-classified between high cluster and med_high cluster, which is plausible considering they are logically next to each other. Other mis-classified observations are mostly located at the border of each cluster. I suppose those observations are quite difficult for the algorithm.


Dimensionality reduction techniques

date()
## [1] "Wed Nov 23 17:56:59 2022"
set.seed(0)
library(MASS)
library(dplyr)
library(tidyverse)
library(corrplot)
library(ggplot2)
library(plotly)
library(GGally)
library(corrplot)
library(FactoMineR)

human <- read.csv('data/human.csv', row.names = 1)

Graphical data overview

ggpairs(human)

cor(human) %>% corrplot(method="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex=0.6)

The plots above are the graphical overview of the dataset. From the correlation plot, we can see a strong positive correlation between Edu.Exp and Life.Exp. This means people who lives in countries with longer life expectation tend to have a better education. The correlation between Mat.Mor and Ado.Birth are strong and positive as well, which suggests female who lives in countries with higher Maternal mortality ratio are also more likely to get pregnant in their adolescence. There are also strong negative correlations in this data such as Life.Exp and Mat.Mor, which makes sense as a higher mortality ratio naturally leads to a lower life expectancy on average.

PCA

PCA on non-standardized data

pca_human <- prcomp(human)
summary(pca_human)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

PCA on standardized data

pca_human <- prcomp(scale(human))
summary(pca_human)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

As we can see from the two plots above, their results are quite different. The first component of the PCA on non-standardized data captured 99% of the variance. We can infer from its biplot that the PC1 is mostly aligned with GNI, which means its variance will overwhelm the PCA if not standardized. This makes sense as the actual value of GNI is huge if compared with other variables.

Inspecting the plot, we can see a much clearer picture of the data. Edu.Exp, Edu2.PM, GNI and Life.Exp have a very strong positive correlation among them, while they all have a strong negative correlation with both Mat.Mor and Ado.Birth. Parli.F and Labo.FM are strongly and positively correlated with each other but not very correlated with the rest of the variables. This agrees with my previous correlation matrix visualization.

interpretation of PCs

Inspecting the previous plot for PCA with standardized data, it is noticeable that PC1 is mostly related to how developed a country is. Edu.Exp, Edu2.PM, Life.Exp, Mat.Mor and Ado.Birth evaluate if the people of a country have been taken good care of. Furthermore, GNI is a direct metric of the economics of a country.

On the other hand, PC2 seems to be mostly related to female living situations. I think it confirms that females in a more developed country may have more working opportunities (Labo.FM) and better political right (Parli.F).

MCA

I select the same columns and visualize them according to the Exercise Set, which means there are 6 variables in my dataset: Tea, How, how, sugar, where, lunch. The data is visualizd below:

data(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               0.279   0.261   0.219   0.189   0.177   0.156   0.144
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519   7.841
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953  77.794
##                        Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.141   0.117   0.087   0.062
## % of var.              7.705   6.392   4.724   3.385
## Cumulative % of var.  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139   0.003
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626   0.027
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111   0.107
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841   0.127
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979   0.035
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990   0.020
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347   0.102
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459   0.161
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968   0.478
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898   0.141
##                     v.test     Dim.3     ctr    cos2  v.test  
## black                0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            2.867 |   0.433   9.160   0.338  10.053 |
## green               -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone               -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                3.226 |   1.329  14.771   0.218   8.081 |
## milk                 2.422 |   0.013   0.003   0.000   0.116 |
## other                5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag             -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged          -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
plot(mca, invisible=c("ind"),habillage = "quali")

MCA analyze qualitative data and place category in a euclidean space so that people can visualize pattern of different categories according to their euclidean distance. As we can see in the summary, MCA generated a 11D space, where the first two dimensions capture 15.238% and 14.232% of the total variance.

The plot of MCA visualized the relationships among different categories. Each color represents a variable. We can observe some interesting pattern from this plot:

  • People tend to put milk and suger when drinking Earl Grey but not put suger when drinking black tea.
  • People tend to buy tea bags from a chain store. But people are more likely to buy unpacked tea when shopping in a tea shop.